Week 2: Regression
PS 818 - Statistical Models
Anton Strezhnev
University of Wisconsin-Madison
September 8, 2025
\[
\require{cancel}
\DeclareMathOperator*{\argmin}{arg\,min}
\]
\[
\DeclareMathOperator*{\argmax}{arg\,max}
\]
Previously
Estimands, Estimators, Estimates
Properties of (frequentist) estimators
Bias
Variance
Consistency
How to articulate uncertainty
What does it mean to say that your 95% confidence interval is \([0.40, 0.60]\) ?
How does it connect to hypothesis testing?
This week
Linear regression and the “Best Linear Predictor”
Classical statistical theory for OLS
When is it unbiased/consistent
How do we do inference?
Gauss-Markov assumptions
Introduction to likelihood inference
Regression
Rather than just estimating a population mean, we are more typically interested in some population conditional expectation \(\mathbb{E}[Y|X]\)
\(Y_i\) : Outcome/response/dependent variable
\(X_i\) : Vector of regressor/independent variables
“How does the expected value of \(Y\) differ across different values of \(X\) ?”
Suppose we observe \(N\) paired observations of \(\{Y_i, X_i\}\) .
How do we construct a “good” estimator of \(\mathbb{E}[Y|X]\) ?
What assumptions do we have to make to get…consistency…unbiasedness…efficiency?
Example: Predicting the 2016 election
How do different parts of the U.S. vary in their election outcomes?
Knowing something about the region, can I predict what share voted for Clinton in the 2016 election?
election <- read_csv ("data/prez_election_2016_vote.csv" )
election$ voteshare <- 100 * (election$ candidatevotes/ election$ totalvotes)
election <- election %>% filter (! is.na (voteshare)&! is.na (pctWhiteNH))
What is the expected vote share (\(Y\) ) conditional on the share of county residents who are non-hispanic white (\(X\) ) ?
Plot the data!
Binning
We could coarsen \(X\) and just calculate means within each bin!
In fact…this is what we do with large datasets anyway…and this ends up also being a regression!
election$ pctWhiteNH_bins <- cut (election$ pctWhiteNH, 10 )
election_means <- election %>% group_by (pctWhiteNH_bins) %>%
summarize (voteshare = mean (voteshare, na.rm= T), pctWhiteNH = median (pctWhiteNH, na.rm= T))
Binning
Best Linear Predictor (BLP)
The Best Linear Predictor or population regression is a function of \(\mathbf{X}\)
\[m(x) = x^\prime\beta = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \dotsc\]
The \(\beta\) parameters minimize the expected prediction error
\[\beta = \argmin_b \ \mathbb{E}[(Y_i - X_i^\prime b)^2]\]
Note that this is a population quantity
\(m(x)\) is an estimand like any other estimand we’ve talked about
\(m(x)\) and \(E[Y|X]\) might differ substantially!
Best Linear Predictor (BLP)
Is there a closed form expression for the population regression coefficients \(\beta\) ?
Yes - with some algebra (solving for the first order conditions), we get
\[\beta = (\mathbb{E}[X_iX_i^\prime])^{-1}\mathbb{E}[X_iY_i]\]
Our linear projection becomes
\[m(X_i) = X_i^\prime\beta = X_i^\prime (\mathbb{E}[X_iX_i^\prime])^{-1}\mathbb{E}[X_iY_i]\]
Properties
What have we assumed so far?
Finite mean/variance/covariances for \(Y\) and \(X\)
Columns of \(X\) are linearly independent (no perfect collinearity)
Projection error
What can we say about the difference between the BLP and \(Y\) ?
\[e_i = Y_i - m(X_i)\]
\[Y_i = X_i^\prime\beta + e_i\]
Projection error
We can show that the projection error is mechanically uncorrelated with the predictors
\[\begin{align*}
\mathbb{E}[X_i e_i] &= \mathbb{E}[X_i(Y_i - X_i^\prime\beta)]\\
&= \mathbb{E}[X_iY_i] - \mathbb{E}[X_iX_i^\prime]\beta\\
&= \mathbb{E}[X_iY_i] - \mathbb{E}[X_iX_i^\prime](\mathbb{E}[X_iX_i^\prime])^{-1}\mathbb{E}[X_iY_i]\\
&= \mathbb{E}[X_iY_i] - \mathbb{E}[X_iY_i] = 0\\
\end{align*}\]
Since \(X_{ik} = 1\) is typically one of the regressors (the “intercept”), this also implies \(\mathbb{E}[e_i] = 0\)
Projection error
This means that the covariance of \(X_i\) and the projection error \(e_i\) is zero!
\[\begin{align*}
Cov(X_i, \epsilon) = \mathbb{E}[X_ie_i] - E[X_i]E[e_i] = 0 - 0 = 0
\end{align*}\]
Importantly, we have derived these properties on the Best Linear Predictor and the projection error without making any assumptions on the error itself!
This is mechanically true by the way we’ve defined the BLP
Estimating the BLP
Remember, \(m(X)\) is a population quantity - can we come up with an estimator in our sample that is consistent for it?
plug-in principle - Where we have population expectations, plug in their sample equivalents
Our estimand
\[\beta = (\mathbb{E}[X_iX_i^\prime])^{-1}\mathbb{E}[X_iY_i]\]
Our estimator
\[\hat{\beta} = \bigg(\frac{1}{n} \sum_{i=1}^n X_iX_i^\prime\bigg)^{-1}\bigg(\frac{1}{n} \sum_{i=1}^n X_iY_i\bigg)\]
Often written as:
\[\hat{\beta} = (\mathbf{X}^\prime\mathbf{X})^{-1}(\mathbf{X}^\prime Y)\]
Estimating the BLP
Is \(\hat{\beta}\) consistent for \(\beta\) ?
Let’s write \(\hat{\beta}\) as a function of \(\beta\) and the projection error
\[\hat{\beta} = \beta + \bigg(\frac{1}{n} \sum_{i=1}^n X_iX_i^\prime\bigg)^{-1}\bigg(\frac{1}{n} \sum_{i=1}^n X_ie_i\bigg) \]
We can use the weak law of large numbers
Sample averages of i.i.d. random variables converge in probability to their population expectations
\[\frac{1}{n} \sum_{i=1}^n X_iX_i^\prime \overset{p}{\to} \mathbb{E}[X_iX_i^\prime]\]
\[\frac{1}{n} \sum_{i=1}^n X_ie_i \overset{p}{\to} \mathbb{E}[X_ie_i] = 0\]
Estimating the BLP
Plugging it all in, we have
\[\hat{\beta} \overset{p}{\to} \beta\]
Ordinary least squares is consistent for the best linear predictor under relatively mild assumptions
i.i.d. sample (can weaken this with LLNs for dependent random variables)
finite moments (true for most outcomes and regressors you’ll work with)
no perfect collinearity (you can check if this is a problem!)
What have we not assumed
Linear CEF
Homoskedastic errors
Normal outcome
Asymptotic normality
In large samples, what is the distribution of \(\hat{\beta}\) ?
\[\sqrt{n}(\hat{\beta} - \beta) = \bigg(\frac{1}{n} \sum_{i=1}^n X_iX_i^\prime\bigg)^{-1}\bigg(\frac{1}{\sqrt{n}} \sum_{i=1}^n X_ie_i\bigg)\]
We know that \(\bigg(\frac{1}{n} \sum_{i=1}^n X_iX_i^\prime\bigg)^{-1} \overset{p}{\to} E[X_iX_i^\prime]^{-1}\)
Asymptotic normality
We can apply the CLT to the other term. We know the expectation is zero, what about the variance?
\[Var(X_ie_i) = \mathbb{E}[X_i e_i (X_i e_i)^\prime] = \mathbb{E}[e_i^2X_i X_i^\prime]\]
By the CLT, we have
\[\frac{1}{\sqrt{n}} \sum_{i=1}^n X_ie_i \overset{d}{\to} \mathcal{N}(0, \mathbb{E}[e_i^2X_i X_i^\prime])\]
Combining with our other convergence result using Slutsky’s theorem gives
\[\sqrt{n}(\hat{\beta} - \beta) \overset{d}{\to} \mathcal{N}\bigg(0, (E[X_iX_i^\prime])^{-1} (\mathbb{E}[e_i^2X_i X_i^\prime]) (E[X_iX_i^\prime])^{-1}\bigg)\]
Variance estimation
Again, we have a bunch of population expectations, let’s plug in their sample equivalents!
Our estimand
\[Var(\hat{\beta}) = \frac{1}{n} (E[X_iX_i^\prime])^{-1} (\mathbb{E}[e_i^2X_i X_i^\prime]) (E[X_iX_i^\prime])^{-1}\]
Our estimator
\[\widehat{Var(\hat{\beta})} = \frac{1}{n} \bigg(\frac{1}{n} \sum_{i=1}^n X_i X_i^\prime \bigg)^{-1}\bigg(\frac{1}{n} \sum_{i=1}^n \hat{e_i}^2 X_i X_i^\prime \bigg)\bigg(\frac{1}{n} \sum_{i=1}^n X_i X_i^\prime \bigg)^{-1}\]
Variance estimation
This is often known as the “sandwich” estimator
\[\widehat{Var(\hat{\beta})} = (\mathbf{X}^\prime\mathbf{X})^{-1} (X^\prime \hat{\Sigma} \mathbf{X}) (\mathbf{X}^\prime\mathbf{X})^{-1}\] where \(\hat{\Sigma} = \text{diag}(\hat{e_1}^2, \hat{e_2}^2, \dotsc \hat{e_n}^2)\)
It’s also referred to as the heteroskedasticity-consistent robust variance estimator
We have made no assumptions on the variance of \(e_i\) – we’re just plugging in the regression residuals.
BLP vs. Binning
Inference
Let’s actually calculate the “robust” standard errors by hand
linreg <- lm (voteshare ~ pctWhiteNH, data= election)
summary (linreg)
Call:
lm(formula = voteshare ~ pctWhiteNH, data = election)
Residuals:
Min 1Q Median 3Q Max
-35.82 -8.31 -0.85 7.20 45.08
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.8083 0.9059 74.9 <2e-16 ***
pctWhiteNH -0.4617 0.0112 -41.3 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.2 on 3112 degrees of freedom
Multiple R-squared: 0.354, Adjusted R-squared: 0.353
F-statistic: 1.7e+03 on 1 and 3112 DF, p-value: <2e-16
Inference
Get the \(\mathbf{X}\) matrix
X <- model.matrix (voteshare ~ pctWhiteNH, data= election)
head (X)
(Intercept) pctWhiteNH
1 1 77.2
2 1 83.5
3 1 46.8
4 1 75.0
5 1 88.9
6 1 21.9
Inference
bread <- solve (t (X)%*% X)
meat <- (t (X)%*% diag (e^ 2 )%*% X)
hc0 <- bread%*% meat%*% bread
hc0
(Intercept) pctWhiteNH
(Intercept) 1.0707 -0.012342
pctWhiteNH -0.0123 0.000149
Inference
Let’s get the standard errors
(Intercept) pctWhiteNH
1.0348 0.0122
Compare to the classical ones
(Intercept) pctWhiteNH
0.9059 0.0112
You can get the robust variance estimator using lm_robust in estimatr
lm_robust (voteshare ~ pctWhiteNH, data= election, se_type= "HC0" )
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 67.808 1.0348 65.5 0.00e+00 65.779 69.837 3112
pctWhiteNH -0.462 0.0122 -37.9 1.31e-258 -0.486 -0.438 3112
Classical regression
We’ve shown that the OLS estimator is consistent and asymptotically normal for the Best Linear Predictor
What about for estimating the conditional expectation function ? \(E[Y|X]\)
To do inference on the CEF, we assume that it’s equal to the BLP!
Assumption 1 : Linearity
\[Y_i = X_i^\prime\beta + \epsilon_i\]
Assumption 2 : Strict exogeneity of the errors
\[\mathbb{E}[\epsilon | \mathbf{X}] = 0\]
What does this get us?
Finite-sample properties of OLS (unbiasedness in addition to consistency!)
When is linearity always true ?
When we have a fully-saturated model!
Classical regression
Assumption 3 : No perfect collinearity
\(\mathbf{X}^{\prime}\mathbf{X}\) is invertible
\(\mathbf{X}\) has full column rank
Again, when does this fail? How can we check?
Classical regression
Under assumptions 1-3, our OLS estimator \(\hat{\beta}\) is also unbiased for \(\beta\)
Let’s do a quick proof for unbiasedness
\[\begin{align*}\hat{\beta} &= (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}Y)\\
&= (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}(\mathbf{X}\beta + \epsilon))\\
&= (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\mathbf{X})\beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon)\\
&= \beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon)
\end{align*}\]
Then we can obtain the conditional expectation of \(\mathbb{E}[\hat{\beta} | \mathbf{X}]\)
\[\begin{align*} \mathbb{E}[\hat{\beta} | \mathbf{X}] &= \mathbb{E}\bigg[\beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon) \bigg| \mathbf{X} \bigg]\\
&= \mathbb{E}[\beta | \mathbf{X}] + \mathbb{E}[(\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon) | \mathbf{X}]\\
&= \beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime} \mathbb{E}[\epsilon | \mathbf{X}]\\
&= \beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime}0\\
&= \beta
\end{align*}\]
Classical regression
Lastly, by law of total expectation
\[\mathbb{E}[\hat{\beta}] = \mathbb{E}[\mathbb{E}[\hat{\beta}|\mathbf{X}]]\]
Therefore
\[\mathbb{E}[\hat{\beta}] = \mathbb{E}[\beta] = \beta\]
Note that the assumptions here are slightly stronger than what we needed for consistency for the BLP
But what have we not assumed?
Anything about the distribution of the errors (aside from finite 1st/2nd moments)
Classical regression
Assumption 4 - Spherical errors
\[Var(\epsilon | \mathbf{X}) = \begin{bmatrix}
\sigma^2 & 0 & \cdots & 0\\
0 & \sigma^2 & \cdots & 0\\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & \sigma^2
\end{bmatrix} = \sigma^2 \mathbf{I}\]
Benefits
Simple, unbiased estimator for the variance of \(\hat{\beta}\)
Completes Gauss-Markov assumptions \(\leadsto\) OLS is BLUE (Best Linear Unbiased Estimator)
Drawbacks
Classical regression
Under spherical errors, the variance formula simplifies
\[Var(\beta) = (\mathbf{X}^\prime\mathbf{X})^{-1} (\sigma^2 X^\prime \mathbf{X}) (\mathbf{X}^\prime\mathbf{X})^{-1}\] \[Var(\beta) = \sigma^2 (\mathbf{X}^\prime\mathbf{X})^{-1}\]
All we need is an estimator for \(\sigma^2\)
A consistent one is just the sample variance of the residuals
Can do better by including a “degrees of freedom” correction – the within-sample variance is always going to under-estimate the population variance
Classical regression
Assumption 5 - Normality of the errors
\[\epsilon | \mathbf{X} \sim \mathcal{N}(0, \sigma^2)\]
Not necessary even for Gauss-Markov assumptions
Not needed to do asymptotic inference on \(\hat{\beta}\)
Why? Central Limit Theorem!
Benefits?
Finite-sample inference (no reliance on asymptotics for hypothesis tests)
Inference for predictions from the model.
Drawbacks
Again, almost never true unless our outcome is a sum or mean of a large number of i.i.d. random variables
Regression review
What do we need for OLS to be consistent for the “best linear approximation” to the CEF?
What do we need for OLS to be consistent and unbiased for the conditional expectation function?
Truly linear CEF
But still no assumptions about the outcome distribution!
What do we need to do inference on \(\hat{\beta}\) ?
We almost never assume homoskedasticity because “robust” SE estimators are ubiquitous
Even some forms of error correlation are permitted (“cluster” robust SEs)
Sample sizes are usually large enough where Central Limit Theorem implies a normal sampling distribution is a reasonable approximation.
Partial regression
It can be difficult to understand the mechanics of OLS with many regressors.
With a single regressor, we had a nice expression for the coefficient on \(X\)
\[Y_i = \beta_0 + \beta_1 X_i + \epsilon\]
\[\hat{\beta} = \frac{\widehat{Cov}(Y_i, X_i)}{\widehat{Var}(X_i)}\]
Can we get something like this in the multivariate case?
\[Y = \mathbf{X}\beta + \epsilon\]
Partial regression
With multiple predictors, we can write the regression coefficient on any of our regressors in terms of partial regressions .
Partition \(\mathbf{X}\) into two sets of covariates \(\mathbb{X}_1\) and \(\mathbb{X}_2\)
\[Y = \mathbb{X}_1\beta_1 + \mathbb{X}_2\beta_2 + \epsilon\]
The Frisch-Waugh-Lovell theorem, states that the OLS estimator of \(\hat{\beta_1}\) can be recovered by:
Obtaining \(Y^{\perp \mathbb{X}_2}\) , the residuals from a regression of \(Y\) on \(\mathbb{X}_2\)
Obtaining \(\mathbb{X}_1^{\perp \mathbb{X}_2}\) , the residuals from a regression of \(\mathbb{X}_1\) on \(\mathbb{X}_2\)
Regressing \(Y^{\perp \mathbb{X}_2}\) on \(\mathbb{X}_1^{\perp \mathbb{X}_2}\)
Core theoretical mechanics of many recent methods papers!
Cinelli and Hazlett (2020) sensitivity analysis
Much of the “new DiD” literature (esp. Goodman-Bacon decomposition, Wooldridge’s paper on “two-way Mundlak”)
Partial regression
Suppose I wanted to “control” for state fixed effects
linreg_state <- lm (voteshare ~ pctWhiteNH + as.factor (state), data= election)
tidy (linreg_state) %>% filter (term == "pctWhiteNH" )
# A tibble: 1 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 pctWhiteNH -0.650 0.0106 -61.5 0
Partial regression
Regression of \(Y\) on the fixed effects
election$ partial_y <- residuals (lm (voteshare ~ as.factor (state), data= election))
Regression of \(X\) on the fixed effects
election$ partial_x <- residuals (lm (pctWhiteNH ~ as.factor (state), data= election))
tidy (lm (partial_y ~ partial_x, data= election)) %>% filter (term == "partial_x" )
# A tibble: 1 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 partial_x -0.650 0.0105 -62.0 0
Partial regression
With fixed effects, partialing is equivalent to de-meaning \(Y\) and \(X\) within unit
Sometimes called the “Mundlak” regression
voteshare_means <- election %>% group_by (state) %>% summarize (voteshare_state = mean (voteshare))
pctWhiteNH_means <- election %>% group_by (state) %>% summarize (pctWhiteNH_state = mean (pctWhiteNH))
election <- election %>% left_join (voteshare_means, by= "state" )
election <- election %>% left_join (pctWhiteNH_means, by= "state" )
election$ demeaned_y <- election$ voteshare - election$ voteshare_state
election$ demeaned_x <- election$ pctWhiteNH - election$ pctWhiteNH_state
tidy (lm (demeaned_y ~ demeaned_x, data= election)) %>% filter (term == "demeaned_x" )
# A tibble: 1 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 demeaned_x -0.650 0.0105 -62.0 0